Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries

In [ ]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [1]:
# to load and manipulate data
import pandas as pd
import numpy as np

# to visualize data
import matplotlib.pyplot as plt
import seaborn as sns

# to split data into training and test sets
from sklearn.model_selection import train_test_split

# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# to tune different models
#from sklearn.model_selection import GridSearchCV

# to compute classification metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
)

# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

Loading the dataset

In [2]:
# uncomment and run the following line if using Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
# Load the data
loan_data = pd.read_csv('/content/drive/MyDrive/Python Course/Loan_Modelling.csv')
In [4]:
# Make a copy of the data; keep the original as a backup
data = loan_data.copy()

Data Overview

View the first and last 5 rows of the data

In [5]:
# View the first 5 rows of the data
data.head()
Out[5]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [6]:
# View the last 5 rows of the data
data.tail()
Out[6]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

Checkout the number of rows and columns of the data

In [7]:
# Shape of the data
data.shape
Out[7]:
(5000, 14)
  • The dataset has 5000 rows and 14 columns.

Check the attributes/data types of the columns

In [8]:
# View the columns and datatypes of the data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
  • All columns are numerical.
  • Personal_Loan, Securities_Account, CD_Account, Online and CreditCard columns are all Yes or No fields, represented by using 0 or 1.

Check the statistical summary

In [9]:
# Check the statistical summary of the data
# data.describe(include="all")
data.describe().T
Out[9]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
  • The average age of customers is ~45 years.
  • The average years of professional experience of customers is ~20 years.
  • The average income of customers is ~74K dollars
  • 50% of customers have family size from 1-3 people.
  • The average spending on credit cards per month of customers is ~2K dollars.
  • The average value of the house mortgage for customers is ~56K dollars
  • Most customers did not get personal loans the last tme it was offered
  • At least half of the customers use internet banking facilities

  • NOTE: The Experience column has an anomaly - minimum value => -3, which is a negative number. This needs to be investigated further.

Check For Missing Data

In [10]:
# Check for null values
data.isnull().sum()
Out[10]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0

  • There is no missing data. Note: This doesn't mean that all data values are valid. So further exploration of the data in columns is required.

Check For Duplicate Data

In [11]:
# Checking for duplicate values
data.duplicated().sum()
Out[11]:
0
  • There is no duplicate data

Check For And Drop Unnecessary Columns

In [12]:
# Check to see if the ID column is unique
data.ID.nunique()
Out[12]:
5000
In [13]:
# Drop the ID column. All values are unique.
data.drop(columns=["ID"], inplace=True)
In [14]:
# Check to confirm that the ID column has been dropped
data.head()
Out[14]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 35 8 45 91330 4 1.0 2 0 0 0 0 0 1

Deal With Anomalous Data

In [15]:
# Do a check for anomalies in each column data
#data["Experience"].unique()
#data["Online"].unique()
#data["CreditCard"].unique()
#data["Family"].unique()
#data["Education"].unique()
#data["CCAvg"].unique()
#data["Mortgage"].unique()
#data["Income"].unique()
#data["Age"].unique()
#data["CD_Account"].unique()
#data["Securities_Account"].unique()
#data["Personal_Loan"].unique()
#data["ZipCode"].unique()

# NOTE: Only Experience has anomalies as stated prior with -ve numbers

data["Experience"].unique()
Out[15]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, -1, 34,  0, 38, 40, 33,  4, -2, 42, -3, 43])
In [16]:
# Check for anomalous data in the Experience column
data["Experience"].unique() # Found -1, -2, -3

# View this negative values data
data[data["Experience"] < 0]["Experience"].unique()

# Remove the -ve in the data
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)

# Check to confirm that the -ve values have been removed
data["Experience"].unique()
Out[16]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, 34,  0, 38, 40, 33,  4, 42, 43])

Exploratory Data Analysis.

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?

- Answer: The Mortgage data is heavily right skewed and there are quite a lot of outliers on the higher side of the data.

In [17]:
# Generate Histogram for Mortgage
sns.histplot(data = data, x = 'Mortgage')
plt.title('Mortgage')
plt.xlabel('Mortgagge')
plt.show()

# Generate Boxplot for Mortgage
sns.boxplot(data = data, x= 'Mortgage')
plt.title('Boxplot of Mortgage')
plt.xlabel('Mortgage')
plt.show()
  1. How many customers have credit cards?
  • Answer: 1470
In [18]:
# How many customers have credit cards?
data.CreditCard.value_counts() # 0: 3530; 1: 1470
Out[18]:
count
CreditCard
0 3530
1 1470

Answers for the following are later down in the notebook

  1. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  2. How does a customer's interest in purchasing a loan vary with their age?
  3. How does a customer's interest in purchasing a loan vary with their education?

Data and columns refactoring and re-engineering

In [19]:
# Check the number of unique zip codes
data["ZIPCode"].nunique()
Out[19]:
467
In [20]:
# The unique number of zip codes are quite a bit; we can do some feature engineering to reduce the unique number of zip codes
# One option is using just the first few characters of the zip codes, say the first two characters; this will give less numbers of unique number of zip codes
data["ZIPCode"] = data["ZIPCode"].astype(str).str[:2]
data["ZIPCode"].nunique()
Out[20]:
7
In [21]:
# Some of the columns should act as category fields although they might have numeric data in them
# So, convert the data type of categorical features to 'category'
category_cols = [
    "Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",
    "ZIPCode",
]
data[category_cols] = data[category_cols].astype("category")

Univariate Analysis

In [22]:
# Function to plot both histogram and boxplot for numeric columns

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)

    For Boxplot:
    Violet triangle/star indicates the mean value of the data

    For Histogram:
    Horizontal green dashed line (--) represents the mean/average of the data
    Horizontal solid black line (-) represents the median of the data
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [23]:
num_col = data.select_dtypes(include=np.number).columns.tolist()
num_col
Out[23]:
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage']
In [24]:
# Create Histograms and Boxplots for numeric fields
num_col = data.select_dtypes(include=np.number).columns.tolist()

for item in num_col:
    histogram_boxplot(data, item)

Observations

  • Customer Age is evenly distributed and the median and mean are similar at ~45 years.
  • Customer years of Experience is evenly distributed and the median and mean are similar at ~20 years.
  • Customer Income is right-skewed with an average of ~74K. There are also a number of income outliers on the higher side.
  • At least 25% of the customers have 3 family members.
  • The average spending on credit cards per month by customers is right-skewed with an average of ~2K. It has a number of outliers on the higher side.
  • The Mortgage data is heavily right-skewed and there are quite a lot of outliers on the higher side of the data.
In [25]:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [26]:
# create labeled barplots for categorical fields
for item in category_cols:
  if item != "Personal_Loan":
    labeled_barplot(data, item, perc=True)

# This chart gives a better understanding of the Family column
labeled_barplot(data, "Family", perc=True)

# Chart the target field - Personal Loan
labeled_barplot(data, "Personal_Loan", perc=True)
In [26]:

Observations

  • ~42% of the customers have an undergraduate degree
  • Majority of the customers, ~90%, do not have security accounts with the bank
  • Majority of the customers, ~94%, do not have CD accounts with the bank
  • ~60% of customers use online/internet banking
  • ~70% of the customers do not have credit cards with other banks
  • The highest number of customers, ~30%, live in zip codes starting with 94
  • There are more customers with one person in the family than the other groups at ~29%, followed by customers with two and four people at ~26%. Customers with three people in the family are the least at ~20%.
  • ~91% of the customers do not have personal loans. We can say that the data is unbalanced.

Bivariate Analysis

In [27]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1), title="Personal Loan")
    plt.show()
In [28]:
### function to plot distributions wrt target

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Check Correlation Amongst Variables

In [29]:
# Analyze the corelationship between the numeric variables
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
Out[29]:
<Axes: >

Observations

  • There is a very high positive corelationship between Age and Experience.
  • There is a high positive corelationship between Income and average credit card spending per month
  • There is a slight positive corelationship between Income and Mortgage.
  • There is a very slight positive corelationship between Mortgage and average credit card spending per month
  • There is a very slight negative corelationship between Family size and average credit card spending per month
  • There is a very slight negative corelationship between Family size and Income
In [30]:
# Scatter plot matrix
plt.figure(figsize=(12, 8))
sns.pairplot(data, vars=num_col, hue='Personal_Loan', diag_kind='kde');
<Figure size 1200x800 with 0 Axes>

Observation

  • Customers with income of greater than 100K are more likely to get a personal loan.

Check how a Customer's interest in purchasing a loan varies with the categorical variables

In [31]:
for item in category_cols:
  if item != "Personal_Loan":
    stacked_barplot(data, item, "Personal_Loan")
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------
Personal_Loan          0    1   All
Securities_Account                 
All                 4520  480  5000
0                   4058  420  4478
1                    462   60   522
------------------------------------------------------------------------------------------------------------------------
Personal_Loan     0    1   All
CD_Account                    
All            4520  480  5000
0              4358  340  4698
1               162  140   302
------------------------------------------------------------------------------------------------------------------------
Personal_Loan     0    1   All
Online                        
All            4520  480  5000
1              2693  291  2984
0              1827  189  2016
------------------------------------------------------------------------------------------------------------------------
Personal_Loan     0    1   All
CreditCard                    
All            4520  480  5000
0              3193  337  3530
1              1327  143  1470
------------------------------------------------------------------------------------------------------------------------
Personal_Loan     0    1   All
ZIPCode                       
All            4520  480  5000
94             1334  138  1472
92              894   94   988
95              735   80   815
90              636   67   703
91              510   55   565
93              374   43   417
96               37    3    40
------------------------------------------------------------------------------------------------------------------------

Observations

  • ~96% of customers with undergraduate degrees, do not have personal loans; ~87% of customers with Graduate and Advanced/Professional education, do not have personal loans
  • ~89% of customers that have securities accounts, do not have personal loans; ~91% of customers that do not have securities accounts, do not have personal loans.
  • ~92% of customers who do not have CD accounts, do not have personal loans; ~46% of customers who have CD accounts have personal loans.
  • ~90% of customers who use internet banking facilities, do not have personal loans; ~90% of customers who do not use internet banking facilities, do not have personal loans.
  • ~90% of customers that have credit cards issued by another bank do not have personal loans; ~90% of customers that do not have credit cards issued by another bank, do not have personal loans.
  • ~90% to ~92% of all customers in all ZIP Codes do not have personal loans.

Check how a customer's interest in purchasing a loan varies with the numeric variables

In [32]:
for item in num_col:
  if item != "Personal_Loan":
    distribution_plot_wrt_target(data,  item, "Personal_Loan")

Observations

  • The Age distribution seems evenly distributed for customers who either have personal loans or not. The median of the age of the customers for both groups is ~45 years.

  • The distribution for Experience seems evenly distributed for customers who either have personal loans or not. The median of the experience of the customer for both groups is ~20 years of experience.

  • The data distribution for Income is heavily right-skewed for customers that do not have personal loans, while the distribution is slightly left-skewed for customers that have personal loans.

  • Amongst the customers that do not have personal loans, the median Income is ~60k dollars and there are a number of outliers. Amongst the customers that have personal loans, the median is ~$140K.

  • Amongst those that do not have personal loans, more families have 1 or 2 people, while amongst those that have personal loans, more families have 3 or 4 people. The median of families that do not have personal loans is 2; The median of families that have personal loans is 3.

  • The data distribution for CCAvg (monthly average spending for credit cards) for those that do not have personal loans is heavily right-skewed and there are a number of outliers for this data. The median for this group is ~1.5K dollars.

  • The data distribution for CCAvg (monthly average spending for credit cards) for those that have personal loans is right-skewed and there are a number of outliers for this data. The median for this group is ~$4K.

  • The data distribution for Mortgages is heavily right-skewed and there are a lot of outliers for both those that have personal loans and those that do not have personal loans.

  • 25% of customers that do not have personal loans, have a mortgage of at least ~$100K. 25% of customers that have personal loans, have a mortgage of at least ~200K dollars.

Questions:

3. What are the attributes that have a strong correlation with the target attribute (personal loan)?

  • CCAvg: customers that have personal loans, have a higher median on monthly average spending on credit cards
  • Family: customers that have personal loans, have a higher median/mean number of family members
  • Income: customers that have personal loans, have a higher median/mean Income
  • CD Accounts: amongst customers that have personal loans, close to 50% of them have CD Accounts

4. How does a customer's interest in purchasing a loan vary with their age?

  • The Age distribution seems evenly distributed for customers who either have personal loans or not. The median of the age of the customers for both groups is ~45 years. So, the customers age does not seem to be a factor.

5. How does a customer's interest in purchasing a loan vary with their education?

  • ~96% of customers with undergraduate degrees, do not have personal loans; ~87% of customers with Graduate and Advanced/Professional education, do not have personal loans. So, most of the customers do not have personal loan irrespective of their eduucation background.

Data Preprocessing

Checking Outliers

In [33]:
# Get the parameters to see how much of outliers are in the data
Q1 = data.select_dtypes(include=["float64", "int64"]).quantile(0.25)  # To find the 25th percentile and 75th percentile.
Q3 = data.select_dtypes(include=["float64", "int64"]).quantile(0.75)

IQR = Q3 - Q1  # Inter Quantile Range (75th perentile - 25th percentile)

lower = (
    Q1 - 1.5 * IQR
)  # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR

# Determine how much outliers in each numeric column
(
    (data.select_dtypes(include=["float64", "int64"]) < lower)
    | (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
Out[33]:
0
Age 0.00
Experience 0.00
Income 1.92
Family 0.00
CCAvg 6.48
Mortgage 5.82

  • For the percentage of outliers for the fields where there are outliers: Mortgage, Income, CCAvg, the outliers are relatively not too many and they seem valid. So, we will leave them as-is.

Data Preparation for Modeling

In [34]:
# Get a copy of the data
data_copy = data.copy()
In [35]:
# Prepre for the Model
# Define the explanatory (independent) and response (dependent) variables
X = data.drop(["Personal_Loan"], axis=1)
Y = data["Personal_Loan"]

X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
X = X.astype(float)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [36]:
print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape, '\n')
print("Percentage of classes in training set:")
print(100*y_train.value_counts(normalize=True), '\n')
print(y_train.value_counts(normalize=True), '\n')
print("Percentage of classes in test set:")
print(100*y_test.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
Shape of training set: (3500, 18)
Shape of test set: (1500, 18) 

Percentage of classes in training set:
Personal_Loan
0    90.542857
1     9.457143
Name: proportion, dtype: float64 

Personal_Loan
0    0.905429
1    0.094571
Name: proportion, dtype: float64 

Percentage of classes in test set:
Personal_Loan
0    90.066667
1     9.933333
Name: proportion, dtype: float64
Personal_Loan
0    0.900667
1    0.099333
Name: proportion, dtype: float64
  • We see that both the test and training data preserve the distribution of data. ~90-91% for No personal loans and ~9-10% for Yes personal loans.

Model Building

Model Evaluation Criterion

  • It is more costly to miss out on the business of those customers that would get a personal loan but the model predicted that they wouldn't. Here missing out of the business revenue from the personal loans they would get, could be huge. These are the False Negatives. So, Recall, would be the metric to look at.
  • Conversely, if a customer is predicted as getting a personal loan when they wouldn't, that would not be as costly. These are the False Positives, which Precision handles. So, we are more concerned with Recall here.
  • Also, based on the fact that we are targeting people to try to get them to get loans, Recall should be the parameter to evaluate
  • Use the model_performance_classification_sklearn function to check the model performance of models.
  • Use the confusion_matrix_sklearn function to plot confusion matrix.
In [37]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [38]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Decision Tree (sklearn default)

In [39]:
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
Out[39]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Check model performance on training data

In [40]:
confusion_matrix_sklearn(model, X_train, y_train)
In [41]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train
Out[41]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
  • As expected, the model is perfect on the training data. Let's see if it is overfitted when we check the test data performance.

Visualizing the Decision Tree

In [42]:
# list of feature names in X_train
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_2', 'Education_3']
In [43]:
# list of feature names in X_train
feature_names = list(X_train.columns)

# set the figure size for the plot
plt.figure(figsize=(20, 30))

# plotting the decision tree
out = tree.plot_tree(
    model,                         # decision tree classifier model
    feature_names=feature_names,    # list of feature names (columns) in the dataset
    filled=True,                    # fill the nodes with colors based on class
    fontsize=9,                     # font size for the node text
    node_ids=False,                 # do not show the ID of each node
    class_names=None,               # whether or not to display class names
)

# add arrows to the decision tree splits if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")    # set arrow color to black
        arrow.set_linewidth(1)          # set arrow linewidth to 1

# displaying the plot
plt.show()
  • We can observe that this is a very complex tree.
In [44]:
# Print a text report showing the rules of a decision tree

print(
    tree.export_text(
        model,    # specify the model
        feature_names=feature_names,    # specify the feature names
        show_weights=True    # specify whether or not to show the weights associated with the model
    )
)
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family <= 3.50
|   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- CCAvg <= 2.20
|   |   |   |   |   |   |   |--- weights: [48.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.20
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |--- Experience <= 11.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Experience >  11.50
|   |   |   |   |   |   |--- Experience <= 16.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Experience >  16.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- Family >  3.50
|   |   |   |   |--- Experience <= 3.50
|   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |--- Experience >  3.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- Experience <= 7.00
|   |   |   |   |   |   |   |--- ZIPCode_92 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode_92 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Experience >  7.00
|   |   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Experience <= 0.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Experience >  0.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |--- Experience <= 13.00
|   |   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Experience >  13.00
|   |   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |   |--- weights: [23.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |   |   |--- weights: [26.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Mortgage >  104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- Family <= 2.00
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  2.00
|   |   |   |   |   |   |   |   |   |--- Experience <= 29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Experience >  29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Family <= 2.50
|   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- Age <= 56.50
|   |   |   |   |   |   |   |   |--- weights: [27.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  56.50
|   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- Age <= 52.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Age >  52.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- Income <= 107.00
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  107.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |--- Family >  2.50
|   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |--- CCAvg <= 4.85
|   |   |   |   |   |   |--- weights: [0.00, 17.00] class: 1
|   |   |   |   |   |--- CCAvg >  4.85
|   |   |   |   |   |   |--- CCAvg <= 4.95
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  4.95
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  57.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- Age <= 59.50
|   |   |   |   |   |   |   |--- Experience <= 33.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Experience >  33.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  59.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|--- Income >  116.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [0.00, 53.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- weights: [0.00, 62.00] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [0.00, 154.00] class: 1

In [45]:
# Determine the importance of features in the tree building
# (The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.
# It is also known as the Gini importance )

print(
    pd.DataFrame(
        model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.306430
Family              0.258143
Education_2         0.168695
Education_3         0.147127
CCAvg               0.045718
CD_Account          0.021361
Age                 0.019531
Experience          0.018229
ZIPCode_94          0.006488
Mortgage            0.003236
Online              0.002224
ZIPCode_92          0.002224
ZIPCode_93          0.000594
Securities_Account  0.000000
ZIPCode_91          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
CreditCard          0.000000
In [46]:
# Plot the chart to show the importance of features
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

We can see the order of importance of the columns: Income, Family, Education, CCAvg, CD_Account, Age, Experience, then the rest.

Checking model performance on test data

In [47]:
confusion_matrix_sklearn(model, X_test, y_test)
In [48]:
decision_tree_perf_test = model_performance_classification_sklearn(
    model, X_test, y_test
)
decision_tree_perf_test
Out[48]:
Accuracy Recall Precision F1
0 0.986667 0.912752 0.951049 0.931507
  • There is a difference between the training and test scores, as expected. The test scores actually do not look bad.

From the analysis and feature importance I see that Age and Experience have kind of a similar effect on the data. So, let's drop Experience and rebuild the model, going through the same process as above.

In [49]:
data_copy.columns
Out[49]:
Index(['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard'],
      dtype='object')
In [50]:
# Use the same starting point data
data = data_copy

# Drop Experience as it is correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]

X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
X = X.astype(float)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [51]:
print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape, '\n')
print("Percentage of classes in training set:")
print(100*y_train.value_counts(normalize=True), '\n')
print(y_train.value_counts(normalize=True), '\n')
print("Percentage of classes in test set:")
print(100*y_test.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
Shape of training set: (3500, 17)
Shape of test set: (1500, 17) 

Percentage of classes in training set:
Personal_Loan
0    90.542857
1     9.457143
Name: proportion, dtype: float64 

Personal_Loan
0    0.905429
1    0.094571
Name: proportion, dtype: float64 

Percentage of classes in test set:
Personal_Loan
0    90.066667
1     9.933333
Name: proportion, dtype: float64
Personal_Loan
0    0.900667
1    0.099333
Name: proportion, dtype: float64
In [52]:
model1 = DecisionTreeClassifier(criterion="gini", random_state=1)
model1.fit(X_train, y_train)
Out[52]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [53]:
confusion_matrix_sklearn(model1, X_train, y_train)
In [54]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model1, X_train, y_train
)
decision_tree_perf_train
Out[54]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [55]:
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_2', 'Education_3']
In [56]:
# list of feature names in X_train
feature_names = list(X_train.columns)

# set the figure size for the plot
plt.figure(figsize=(20, 30))

# plotting the decision tree
out = tree.plot_tree(
    model1,                         # decision tree classifier model
    feature_names=feature_names,    # list of feature names (columns) in the dataset
    filled=True,                    # fill the nodes with colors based on class
    fontsize=9,                     # font size for the node text
    node_ids=False,                 # do not show the ID of each node
    class_names=None,               # whether or not to display class names
)

# add arrows to the decision tree splits if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")    # set arrow color to black
        arrow.set_linewidth(1)          # set arrow linewidth to 1

# displaying the plot
plt.show()
In [57]:
print(tree.export_text(model1, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family <= 3.50
|   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- CCAvg <= 2.20
|   |   |   |   |   |   |   |--- weights: [48.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.20
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |--- Age <= 37.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Age >  37.50
|   |   |   |   |   |   |--- Income <= 112.00
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Income >  112.00
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- Family >  3.50
|   |   |   |   |--- Age <= 32.50
|   |   |   |   |   |--- CCAvg <= 2.40
|   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.40
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  32.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |--- Age <= 37.50
|   |   |   |   |   |   |   |   |--- Age <= 33.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  33.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  37.50
|   |   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |   |--- weights: [23.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |   |   |--- weights: [26.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Mortgage >  104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- ZIPCode_91 <= 0.50
|   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ZIPCode_91 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Family <= 2.50
|   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- Age <= 56.50
|   |   |   |   |   |   |   |   |--- weights: [27.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  56.50
|   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- Income <= 107.00
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  107.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |--- Family >  2.50
|   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |--- CCAvg <= 4.85
|   |   |   |   |   |   |--- weights: [0.00, 17.00] class: 1
|   |   |   |   |   |--- CCAvg >  4.85
|   |   |   |   |   |   |--- CCAvg <= 4.95
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  4.95
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  57.50
|   |   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- Age <= 59.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  59.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|--- Income >  116.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [0.00, 53.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- weights: [0.00, 62.00] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [0.00, 154.00] class: 1

In [58]:
# Determine the importance of features in the tree building
# (The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.
# It is also known as the Gini importance )

print(
    pd.DataFrame(
        model1.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.308098
Family              0.259255
Education_2         0.166192
Education_3         0.147127
CCAvg               0.048798
Age                 0.033150
CD_Account          0.017273
ZIPCode_94          0.007183
ZIPCode_93          0.004682
Mortgage            0.003236
Online              0.002224
Securities_Account  0.002224
ZIPCode_91          0.000556
ZIPCode_92          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
CreditCard          0.000000
In [59]:
# Plot the chart to show the importance of features
importances = model1.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • We see that a lot of it is the same as with when Experience was a column in the data, with a few differences. Age is now more significant than CD_Account, etc.
In [60]:
confusion_matrix_sklearn(model1, X_test, y_test)
In [61]:
decision_tree_perf_test = model_performance_classification_sklearn(model1, X_test, y_test)
decision_tree_perf_test
Out[61]:
Accuracy Recall Precision F1
0 0.986 0.932886 0.926667 0.929766
  • The Recall score is higher with the Experience column removed, while other scores are lower.
  • I thought it would be fun to experiment. Now, let's move along!

Model Performance Improvement

Pre-Pruning

In [62]:
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]

'''
# I tried these parameters also and
#   max_depth_values=2, max_leaf_nodes_values=10, min_samples_split_values=10, produced a Recall of 1
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = np.arange(10, 51, 10)
min_samples_split_values = np.arange(10, 51, 10)
'''

# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # Initialize the tree with the current set of parameters
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                random_state=42
            )

            # Fit the model to the training data
            estimator.fit(X_train, y_train)

            # Make predictions on the training and test sets
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            # Calculate recall scores for training and test sets
            train_recall_score = recall_score(y_train, y_train_pred)
            test_recall_score = recall_score(y_test, y_test_pred)

            # Calculate the absolute difference between training and test recall scores
            score_diff = abs(train_recall_score - test_recall_score)

            # Update the best estimator and best score if the current one has a smaller score difference
            if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
                best_score_diff = score_diff
                best_test_score = test_recall_score
                best_estimator = estimator

# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found:
Max depth: 2
Max leaf nodes: 50
Min samples split: 10
Best test recall score: 1.0
In [63]:
# Fit the best algorithm to the data.
estimator = best_estimator
estimator.fit(X_train, y_train)
Out[63]:
DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=50,
                       min_samples_split=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model Evaluation

In [64]:
confusion_matrix_sklearn(estimator, X_train, y_train)
In [65]:
dtree_pre_pruning_train_perf = model_performance_classification_sklearn(
    estimator, X_train, y_train
)
dtree_pre_pruning_train_perf
Out[65]:
Accuracy Recall Precision F1
0 0.790286 1.0 0.310798 0.474212

Visualizing the Decision Tree

In [66]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
  • The complexity of the tree was drastically reduced compared to the default model.
In [67]:
# Print text report showing the rules of a decision tree -

print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1344.67, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [64.61, 79.31] class: 1
|--- Income >  92.50
|   |--- Family <= 2.50
|   |   |--- weights: [298.20, 697.89] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [42.52, 972.81] class: 1

In [68]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        estimator.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.876529
CCAvg               0.066940
Family              0.056531
Age                 0.000000
ZIPCode_92          0.000000
Education_2         0.000000
ZIPCode_96          0.000000
ZIPCode_95          0.000000
ZIPCode_94          0.000000
ZIPCode_93          0.000000
CreditCard          0.000000
ZIPCode_91          0.000000
Online              0.000000
CD_Account          0.000000
Securities_Account  0.000000
Mortgage            0.000000
Education_3         0.000000
In [69]:
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • From this Pre-Pruned model, Income, CCAvg and Family are the most important variables affecting the probability of getting a personal loan.

Check Performance of Test Data

In [70]:
confusion_matrix_sklearn(estimator, X_test, y_test)
In [71]:
dtree_pre_pruning_test_perf = model_performance_classification_sklearn(
    estimator, X_test, y_test
)
dtree_pre_pruning_test_perf
Out[71]:
Accuracy Recall Precision F1
0 0.779333 1.0 0.310417 0.473768

Observations For Pre-Pruning

  • Interestingly, with these model parameters: Max depth: 2, Max leaf nodes: 50, Min samples split: 10, we see that the Recall for both the training and test data is 1. That is, there are no False Negatives. Also, it is note worthy that the Accuracy, Precision and F1 scores are not too far off.

  • I also tried some other parameters and found at least one other model that gave a Recall of 1. There parameters are: Max depth: 2, Max leaf nodes: 10, Min samples split: 10

  • I really I amnot comfortable with the fact that this model, gives a Recall score of 1, which looks like it is perfect for our purposes. It is too perfect!!! Also, the fact that the Accuracy, Precision and the F1 scores, are quite low is suspect. Lastly, our instructor said that the difference between Recall and Precision should be like 20%.

  • I will try other parameters, so bear with me. I will move faster with this one.

In [72]:
# Define the parameters of the tree to iterate over

class_weight = [None, "balanced"]
max_depth_values = [3, 5, 7, 9, 10]
max_leaf_nodes_values = [50, 100, 200, 500, 1000]
min_samples_split_values = [2, 10, 50, 100]

# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # Initialize the tree with the current set of parameters
            estimatorZ = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                random_state=42
            )

            # Fit the model to the training data
            estimatorZ.fit(X_train, y_train)

            # Make predictions on the training and test sets
            y_train_pred = estimatorZ.predict(X_train)
            y_test_pred = estimatorZ.predict(X_test)

            # Calculate recall scores for training and test sets
            train_recall_score = recall_score(y_train, y_train_pred)
            test_recall_score = recall_score(y_test, y_test_pred)

            # Calculate the absolute difference between training and test recall scores
            score_diff = abs(train_recall_score - test_recall_score)

            # Update the best estimator and best score if the current one has a smaller score difference
            if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
                best_score_diff = score_diff
                best_test_score = test_recall_score
                best_estimator = estimatorZ

# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found:
Max depth: 5
Max leaf nodes: 50
Min samples split: 100
Best test recall score: 0.9798657718120806
In [73]:
# Fit the best algorithm to the data.
estimator1 = best_estimator
estimator1.fit(X_train, y_train)
Out[73]:
DecisionTreeClassifier(class_weight='balanced', max_depth=5, max_leaf_nodes=50,
                       min_samples_split=100, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [74]:
confusion_matrix_sklearn(estimator1, X_train, y_train)
In [75]:
dtree_pre_pruning_train_perf1 = model_performance_classification_sklearn(
    estimator1, X_train, y_train
)
dtree_pre_pruning_train_perf1
Out[75]:
Accuracy Recall Precision F1
0 0.938286 0.990937 0.606285 0.752294
In [76]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator1,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [77]:
# Print text report showing the rules of a decision tree -

print(tree.export_text(estimator1, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1344.67, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- weights: [41.42, 52.87] class: 1
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- weights: [23.19, 0.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [0.00, 26.44] class: 1
|--- Income >  92.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |--- weights: [24.85, 15.86] class: 0
|   |   |   |   |--- Income >  103.50
|   |   |   |   |   |--- weights: [239.11, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [16.57, 306.65] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- weights: [17.67, 47.58] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 327.79] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 113.50
|   |   |   |--- weights: [39.21, 126.89] class: 1
|   |   |--- Income >  113.50
|   |   |   |--- Age <= 66.00
|   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |--- weights: [2.76, 31.72] class: 1
|   |   |   |   |--- Income >  116.50
|   |   |   |   |   |--- weights: [0.00, 814.20] class: 1
|   |   |   |--- Age >  66.00
|   |   |   |   |--- weights: [0.55, 0.00] class: 0

In [78]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        estimator1.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.683592
Education_2         0.152816
CCAvg               0.058040
Education_3         0.053806
Family              0.042686
CD_Account          0.008357
Age                 0.000702
Online              0.000000
Securities_Account  0.000000
ZIPCode_91          0.000000
ZIPCode_92          0.000000
ZIPCode_93          0.000000
ZIPCode_94          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
Mortgage            0.000000
CreditCard          0.000000
In [79]:
importances = estimator1.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
In [80]:
confusion_matrix_sklearn(estimator1, X_test, y_test)
In [81]:
dtree_pre_pruning_test_perf1 = model_performance_classification_sklearn(
    estimator1, X_test, y_test
)
dtree_pre_pruning_test_perf1
Out[81]:
Accuracy Recall Precision F1
0 0.934667 0.979866 0.605809 0.748718
  • From this Pre-Pruned model, Income, Education, CCAvg, Family and CD_Account are the most important variables affecting the probability of getting a personal loan.
  • Frankly speaking, I am more confortable with this model as the Recall is still very high, while Accuracy, Precision and F1 scores improved dramatically from the prior model that had a Recall of 1.
  • This model is good with test data and also generalizes from training to test data for our purposes.

Post Pruning

In [82]:
# Create an instance of the decision tree model
clf = DecisionTreeClassifier(random_state=1)

# Compute the cost complexity pruning path for the model using the training data
path = clf.cost_complexity_pruning_path(X_train, y_train)

# Extract the array of effective alphas from the pruning path
ccp_alphas = abs(path.ccp_alphas)

# Extract the array of total impurities at each alpha along the pruning path
impurities = path.impurities
In [76]:
pd.DataFrame(path)
Out[76]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000186 0.001114
2 0.000214 0.001542
3 0.000242 0.002750
4 0.000250 0.003250
5 0.000268 0.004324
6 0.000272 0.004868
7 0.000276 0.005420
8 0.000381 0.005801
9 0.000527 0.006329
10 0.000625 0.006954
11 0.000700 0.007654
12 0.000769 0.010731
13 0.000882 0.014260
14 0.000889 0.015149
15 0.001026 0.017200
16 0.001305 0.018505
17 0.001647 0.020153
18 0.002333 0.022486
19 0.002407 0.024893
20 0.003294 0.028187
21 0.006473 0.034659
22 0.025146 0.084951
23 0.039216 0.124167
24 0.047088 0.171255
In [83]:
# Create a figure
fig, ax = plt.subplots(figsize=(10, 5))

# Plot the total impurities versus effective alphas, excluding the last value,
# using markers at each data point and connecting them with steps
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")

# Set the x-axis label
ax.set_xlabel("Effective Alpha")

# Set the y-axis label
ax.set_ylabel("Total impurity of leaves")

# Set the title of the plot
ax.set_title("Total Impurity vs Effective Alpha for training set");
  • Next, we train a decision tree using the effective alphas.

  • The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the corresponding tree with one node.

In [84]:
# Initialize an empty list to store the decision tree classifiers
clfs = []

# Iterate over each ccp_alpha value extracted from cost complexity pruning path
for ccp_alpha in ccp_alphas:
    # Create an instance of the DecisionTreeClassifier
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=1)

    # Fit the classifier to the training data
    clf.fit(X_train, y_train)

    # Append the trained classifier to the list
    clfs.append(clf)

# Print the number of nodes in the last tree along with its ccp_alpha value
print(
    "Number of nodes in the last tree is {} with ccp_alpha {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is 1 with ccp_alpha 0.04708834100596766
In [85]:
# Remove the last classifier and corresponding ccp_alpha value from the lists
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

# Extract the number of nodes in each tree classifier
node_counts = [clf.tree_.node_count for clf in clfs]

# Extract the maximum depth of each tree classifier
depth = [clf.tree_.max_depth for clf in clfs]

# Create a figure and a set of subplots
fig, ax = plt.subplots(2, 1, figsize=(10, 7))

# Plot the number of nodes versus ccp_alphas on the first subplot
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("Alpha")
ax[0].set_ylabel("Number of nodes")
ax[0].set_title("Number of nodes vs Alpha")

# Plot the depth of tree versus ccp_alphas on the second subplot
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("Alpha")
ax[1].set_ylabel("Depth of tree")
ax[1].set_title("Depth vs Alpha")

# Adjust the layout of the subplots to avoid overlap
fig.tight_layout()
In [86]:
train_recall = []  # Initialize an empty list to store Recall scores for training set for each decision tree classifier

# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
    # Predict labels for the training set using the current decision tree classifier
    pred_train = clf.predict(X_train)

    # Calculate the recall score for the training set predictions compared to true labels
    train_recall_score = recall_score(y_train, pred_train)

    # Append the calculated recall score to the train_recall list
    train_recall.append(train_recall_score)
In [87]:
test_recall = []  # Initialize an empty list to store Recall scores for test set for each decision tree classifier

# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
    # Predict labels for the test set using the current decision tree classifier
    pred_test = clf.predict(X_test)

    # Calculate the recall score for the test set predictions compared to true labels
    recall_test_score = recall_score(y_test, pred_test)

    # Append the calculated recall score to the test_recall_scores list
    test_recall.append(recall_test_score)
In [88]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]  # Calculate training scores for each decision tree classifier
test_scores = [clf.score(X_test, y_test) for clf in clfs]  # Calculate testing scores for each decision tree classifier
In [89]:
test_recall
train_recall
ccp_alphas
# the arrays test_recall, train_recall  in a datafframe
pd.DataFrame({'test_recall': test_recall, 'train_recall': train_recall, 'ccp_alphas': ccp_alphas})
Out[89]:
test_recall train_recall ccp_alphas
0 0.932886 1.000000 0.000000
1 0.926174 0.993958 0.000186
2 0.919463 0.990937 0.000214
3 0.906040 0.984894 0.000242
4 0.906040 0.981873 0.000250
5 0.892617 0.975831 0.000268
6 0.892617 0.975831 0.000272
7 0.892617 0.972810 0.000276
8 0.892617 0.972810 0.000381
9 0.892617 0.969789 0.000527
10 0.879195 0.963746 0.000625
11 0.879195 0.957704 0.000700
12 0.865772 0.939577 0.000769
13 0.852349 0.921450 0.000882
14 0.845638 0.915408 0.000889
15 0.838926 0.900302 0.001026
16 0.805369 0.888218 0.001305
17 0.825503 0.897281 0.001647
18 0.805369 0.882175 0.002333
19 0.751678 0.812689 0.002407
20 0.751678 0.812689 0.003294
21 0.751678 0.812689 0.006473
22 0.402685 0.465257 0.025146
23 0.000000 0.000000 0.039216
In [90]:
# Create a figure
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("Alpha")  # Set the label for the x-axis
ax.set_ylabel("Recall Score")  # Set the label for the y-axis
ax.set_title("Recall Score vs Alpha for training and test sets")  # Set the title of the plot

# Plot the training Recall scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, train_recall, marker="o", label="training", drawstyle="steps-post")

# Plot the testing Recall scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, test_recall, marker="o", label="test", drawstyle="steps-post")

ax.legend();  # Add a legend to the plot
In [91]:
# creating the model where we get highest test Recall Score
index_best_model = np.argmax(test_recall)

# selecting the decision tree model corresponding to the highest test score
dtree_post_pruning = clfs[index_best_model]
print(dtree_post_pruning)
DecisionTreeClassifier(random_state=1)

Check Performance on Training Data

In [92]:
confusion_matrix_sklearn(dtree_post_pruning, X_train, y_train)
In [93]:
dtree_post_pruning_train_perf = model_performance_classification_sklearn(
    dtree_post_pruning, X_train, y_train
)
dtree_post_pruning_train_perf
Out[93]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
  • I am not sure how, but this Post-Pruned data look like it might be overfitted but we will confirm shortly, when we check the test data.

Visualize the Decision Tree

In [94]:
plt.figure(figsize=(10, 13))
out = tree.plot_tree(
    dtree_post_pruning,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [95]:
# importance of features in the tree building
#( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        dtree_post_pruning.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.308098
Family              0.259255
Education_2         0.166192
Education_3         0.147127
CCAvg               0.048798
Age                 0.033150
CD_Account          0.017273
ZIPCode_94          0.007183
ZIPCode_93          0.004682
Mortgage            0.003236
Online              0.002224
Securities_Account  0.002224
ZIPCode_91          0.000556
ZIPCode_92          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
CreditCard          0.000000
In [96]:
importances = dtree_post_pruning.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • From this Post-Pruned model, Income, Family, Education, CCAvg, Age, CD_Account, and some other less significant variables are affecting the probability of getting a personal loan.

Check performance on test data

In [97]:
confusion_matrix_sklearn(dtree_post_pruning, X_test, y_test)
In [98]:
dtree_post_pruning_test_perf = model_performance_classification_sklearn(
    dtree_post_pruning, X_test, y_test
)
dtree_post_pruning_test_perf
Out[98]:
Accuracy Recall Precision F1
0 0.986 0.932886 0.926667 0.929766
  • WowI The model did so well on the test data. So, it is not overfitting as we might have thought!
  • I still don't like the fact that the performance with training data was all 1s. So, I would like to experiment with the highest ccp_alpha before the Recall dipped and then started going back up. The ccp_alpha value = 0.001647 from the data.
In [102]:
# Create a model with the ccp_alpha value
# And balance the data using a proportion that is lose to the inverse of the proportion of the frequency of outcomes
dtree_post_pruning1 = DecisionTreeClassifier(
    ccp_alpha=0.001647, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
dtree_post_pruning1.fit(X_train, y_train)
Out[102]:
DecisionTreeClassifier(ccp_alpha=0.001647, class_weight={0: 0.15, 1: 0.85},
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [103]:
confusion_matrix_sklearn(dtree_post_pruning1, X_train, y_train)
In [104]:
decision_tree_tune_post_train1 = model_performance_classification_sklearn(dtree_post_pruning1, X_train, y_train)
decision_tree_tune_post_train1
Out[104]:
Accuracy Recall Precision F1
0 0.979429 0.987915 0.827848 0.900826
  • Way better, in my opinion, than the model with all 1s on train data.
In [105]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    dtree_post_pruning1,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [106]:
# Text report showing the rules of a decision tree -

print(tree.export_text(dtree_post_pruning1, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [374.10, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |--- weights: [7.35, 2.55] class: 0
|   |   |   |   |--- Income >  81.50
|   |   |   |   |   |--- weights: [4.35, 9.35] class: 1
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- weights: [6.75, 0.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [0.15, 6.80] class: 1
|--- Income >  98.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |--- weights: [0.45, 1.70] class: 1
|   |   |   |   |--- Income >  100.00
|   |   |   |   |   |--- weights: [67.20, 0.85] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 110.00
|   |   |   |   |   |--- weights: [1.80, 0.00] class: 0
|   |   |   |   |--- Income >  110.00
|   |   |   |   |   |--- weights: [1.05, 47.60] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [1.95, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- weights: [1.50, 6.80] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 52.70] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 113.50
|   |   |   |--- CCAvg <= 2.75
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [3.90, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- weights: [1.65, 5.10] class: 1
|   |   |   |--- CCAvg >  2.75
|   |   |   |   |--- Age <= 57.00
|   |   |   |   |   |--- weights: [0.15, 11.90] class: 1
|   |   |   |   |--- Age >  57.00
|   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |--- Income >  113.50
|   |   |   |--- weights: [0.90, 136.00] class: 1

In [109]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        dtree_post_pruning1.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.626488
Education_2         0.145538
CCAvg               0.075287
Education_3         0.069879
Family              0.063079
CD_Account          0.011712
Age                 0.008017
Online              0.000000
Securities_Account  0.000000
ZIPCode_91          0.000000
ZIPCode_92          0.000000
ZIPCode_93          0.000000
ZIPCode_94          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
Mortgage            0.000000
CreditCard          0.000000
In [110]:
importances = dtree_post_pruning1.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • From this second Post-Pruned model, Income, Education, CCAvg, Family, CD_Account and Age are the most important variables affecting the probability of getting a personal loan.
In [111]:
confusion_matrix_sklearn(dtree_post_pruning1, X_test, y_test)
In [112]:
decision_tree_tune_post_test1 = model_performance_classification_sklearn(dtree_post_pruning1, X_test, y_test)
decision_tree_tune_post_test1
Out[112]:
Accuracy Recall Precision F1
0 0.970667 0.919463 0.810651 0.861635
  • This model does well in terms of overfitting and generalization.

Model Performance Comparison and Final Model Selection

In [114]:
# training performance comparison

models_train_comp_df = pd.concat(
    [decision_tree_perf_train.T, dtree_pre_pruning_train_perf.T, dtree_pre_pruning_train_perf1.T, dtree_post_pruning_train_perf.T, decision_tree_tune_post_train1.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning: 2,50,10)", "Decision Tree (Pre-Pruning 1: 5,50,100)", "Decision Tree (Post-Pruning: Recall 1)", "Decision Tree (Post-Pruning: Recall < 1, ccp_alpha = 0.001647)"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[114]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning: 2,50,10) Decision Tree (Pre-Pruning 1: 5,50,100) Decision Tree (Post-Pruning: Recall 1) Decision Tree (Post-Pruning: Recall < 1, ccp_alpha = 0.001647)
Accuracy 1.0 0.790286 0.938286 1.0 0.979429
Recall 1.0 1.000000 0.990937 1.0 0.987915
Precision 1.0 0.310798 0.606285 1.0 0.827848
F1 1.0 0.474212 0.752294 1.0 0.900826
In [115]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [decision_tree_perf_test.T, dtree_pre_pruning_test_perf.T, dtree_pre_pruning_test_perf1.T, dtree_post_pruning_test_perf.T, decision_tree_tune_post_test1.T], axis=1,
)
models_test_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning: 2,50,10)", "Decision Tree (Pre-Pruning 1: 5,50,100)", "Decision Tree (Post-Pruning: Recall 1)", "Decision Tree (Post-Pruning: Recall < 1, ccp_alpha = 0.001647)"]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[115]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning: 2,50,10) Decision Tree (Pre-Pruning 1: 5,50,100) Decision Tree (Post-Pruning: Recall 1) Decision Tree (Post-Pruning: Recall < 1, ccp_alpha = 0.001647)
Accuracy 0.986000 0.779333 0.934667 0.986000 0.970667
Recall 0.932886 1.000000 0.979866 0.932886 0.919463
Precision 0.926667 0.310417 0.605809 0.926667 0.810651
F1 0.929766 0.473768 0.748718 0.929766 0.861635
  • The decision tree model with pre-pruning and parameters (max_depth_values=2, max_leaf_nodes_values=10, min_samples_split_values=10), gives the highest Recall score of 1 on both training and testing data. So this would be the model to go with. But the Accuracy, Precision and F1 scores are really nothing to write home about but if they are maanageable for the client for their purposes, then this would be okay.
  • Note: Generally, this model doesn't appear production ready (based on the information from our MLS instructor; he says that the Recall and Precision should be 20% or less apart), even with the Recall score of 1 in both training and testing data because of the values of other scores.

  • The model with the next highest Recall in the train and test data is the second pre-pruned model with parameters (max_depth_values=5, max_leaf_nodes_values=50, min_samples_split_values=100). This model has way better Accuracy, Precision and F1 scores from the first pre-pruned model, although the latter two scores are still a bit sub-standard in my opinion. If those numbers are okay with the client, this model could work.

  • The default model and the first post-pruned model produce exactly the same performance scores. I am not sure why that would be the case, especially with the post-pruning being applied. The idea is to reduce the tree but it appears not to have done it. Anyway, the Recall score on this model is not as high as the first two models discussed, but is high enough and the other scores are very high, so these would not be bad models to use since it generalizes well. But the tree is the most complex; that might be a point to consider.

  • Income is the most important feature affect if the customer will get a personal loan. This is the same with all models. The other notable fetures are CCAvg, Family with the chosen model and Education with other models.

Predictiing on a single data point

In [245]:
%%time
# Choosing a data point
#applicant_details = X_test.iloc[:1, :] # Does not Have Personal Loan
applicant_details = X_test.iloc[8:9, :] # Has Personal Loan

#print(applicant_details)
# making a decision
approval_prediction = estimator.predict(applicant_details)

print(approval_prediction)
[1]
CPU times: user 5.9 ms, sys: 0 ns, total: 5.9 ms
Wall time: 16.1 ms
In [246]:
%%time
# Predict the likelihood
approval_likelihood = estimator.predict_proba(applicant_details)

#print(approval_likelihood)
print(approval_likelihood[0, 1])
0.9581207493472534
CPU times: user 5.1 ms, sys: 0 ns, total: 5.1 ms
Wall time: 5.84 ms

Further Observation

I located an actual data point that has a 1 in the Personal_Loan column and used it to run predictions using all the models I created. ONLY the model we decided to go with predicted correctly that this customer would get a Personal Loan. Also, the likelihood of the customer getting a Personal Loan was high at ~96%. All the other models predicted otherwise. So, this validates the model we chose for this purpose, regardless of the values of the other scores. This is huge because a process that may manually take hours to accomplish is taking milliseconds to be done. Wow!

Actionable Insights and Business Recommendations

  • What recommedations would you suggest to the bank?

From the decision tree of the chosen model, we observe that if income is greater than 92.5K and the family size is greater than 2.5, the customer is most likely to get a personal loan.

  • So, the AllLife bank should focus on this category of customers for their campaigns.

The bank can deploy the model for the process of determining the obvious cases of which customer will or will not get a personal loan, leaving the rather non-obvious cases to be handled manually by the experts. So, this can serve as an initial screening done in an automated manner.

The bank can use the likelihood aspect of the model as a confidence factor to determine if a customer should get the personal loan automatically or if further srutiny is needed, based on the likelihood number using a predetermined threshold.

This process (using the model), will reduce the work load for the bank and reduce the turnaround time for initial screening and also for the whole process of determining which customer will or will not get a personal loan.